Reducing Storage Costs for Federated Search of Text Databases

نویسندگان

  • Jie Lu
  • James P. Callan
چکیده

In environments containing many text search engines a federated search system provides people with a single point of access. When search engines are managed by independent organizations two key problems are discovering and representing the contents of each text database. Query-based sampling is a recent technique for discovering the contents of uncooperative databases so as to create database resource descriptions that support a variety of necessary capabilities. However, when the documents obtained by query-based sampling are very long, as is common in some government environments, disk storage costs can be surprisingly large. This paper investigates methods of pruning sampled documents to reduce storage costs. The experimental results demonstrate that disk storage costs can be reduced by 54-93% while causing only minor losses in federated search accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Federated search Searching information across the AstraZeneca organisation

Finding information that is stored among many different databases has become a serious problem because of the increasing number of searchable databases on local area networks and on the Internet. Many large organisations in change, suffer from this problem due to a large number of databases and many different uncooperative search tools. By using federated search a single search interface provid...

متن کامل

Hierarchies of Indices for Text

| We present an eecient implementation of a recently known index for text databases, when the database is stored on secondary storage devices such as magnetic or optical disks. The implementation is built on top of a new and simple index for texts called pat array (or suux array). Considering that text searching in a large database spends most of the time accessing external storage devices, we ...

متن کامل

Optimized Binary Search and Text Retrieval

We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs, such as disk data. Indirect binary search means that sorted access to the data is obtained through an array of pointers to the raw data. One immediate application of this algorithm is to improve the retrieval performance of disk databases that are indexed using the suux ar...

متن کامل

Optimized Binary Search and Text Retrieval 1 Eduardo

We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs, such as disk data. Indirect binary search means that sorted access to the data is obtained through an array of pointers to the raw data. One immediate application of this algorithm is to improve the retrieval performance of disk databases that are indexed using the suux ar...

متن کامل

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Peer-to-peer architectures are a potentially powerful model for developing large-scale networks of text-based digital libraries, but peer-to-peer networks have so far provided very limited support for text-based federated search of digital libraries using relevancebased ranking. This paper addresses the problems of resource representation, resource ranking and selection, and result merging for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003